[multi-gpu] Phase 7: aircc integration (--multi-gpu flag) by erwei-xilinx · Pull Request #1582 · Xilinx/mlir-air

erwei-xilinx · 2026-05-03T20:04:50Z

Stacked on #1576 (Phase 1) through #1581 (Phase 6).

What this PR does

Adds a --multi-gpu flag to aircc that runs the host-only multi-GPU compilation pipeline:

air-cross-rank-dma-to-mgpu (Phase 5)
air-gpu-channel-to-mgpu (Phase 6)
air-symmetric-alloc-to-mgpu (Phase 4)
air-rank-to-mgpu (Phase 3)
Standard LLVM finalization

The output of aircc --multi-gpu is a fully-lowered MLIR module that can be executed by mlir-runner as N processes, with cross-rank communication routed through the symmetric-heap runtime.

Files

tools/aircc/aircc.cpp — adds --multi-gpu flag and runMultiGpuCompilation() entry point.
test/gpu/symmetric_heap_dma/run.sh — adds INPUT=prelowered selector that takes a pre-lowered file (e.g. aircc --multi-gpu output) via SRC=path env var and skips re-lowering.

Test plan

End-to-end validation on rad-mi325x-1 (8× MI325X, gfx942, ROCm 7.1.1, fully-connected XGMI), each rank pinned to its own GPU:

INPUT	W=2	W=4
handwritten (Phase 2 baseline: kernel-driven)	✅	✅
rank (Phase 3)	✅	✅
alloc (Phase 4)	✅	✅
dma (Phase 5)	✅	✅
channel (Phase 6)	✅	✅

The prelowered selector exercises the aircc --multi-gpu path by feeding pre-lowered MLIR through the multi-process driver.

🤖 Generated with Claude Code

Before writing any lowering pass, prove the symmetric-heap runtime works end-to-end from MLIR by hand-writing the IR that future passes should emit. This locks down the lowered shape, surfaces ABI gaps early, and provides a reference oracle for diff-testing the upcoming air-rank-to-mgpu / cross-rank-DMA / channel-on-GPU passes. - `test/gpu/symmetric_heap_dma/air_sym_handwritten.mlir` — hand-written reference IR. Each rank: init heap, alloc symmetric buffer, fill with (rank+1).0, barrier, read peer's buffer via `mgpuGetHeapBases()[peer]`, D2D into local copy, D2H readback, verify, print PASS/FAIL. - `test/gpu/symmetric_heap_dma/run.sh` — driver that lowers the IR with `mlir-opt`, then forks N processes with RANK/WORLD_SIZE/LOCAL_RANK env vars set and runs `mlir-runner` in each. `SHARE_GPU=1` env makes all ranks share GPU 0 for testing on single-GPU hosts. - ✅ Verified end-to-end on rad-mi300a-sh5-1 (1×MI300A, ROCm 7.1.1) with `SHARE_GPU=1` and 2 ranks: rank 0 sees `2.0` from rank 1, rank 1 sees `1.0` from rank 0. - ⚠️ rad-mi300x-1 (8×MI300X, ROCm 6.4.0) hits a runtime-side crash inside libamdhip64.so during `establishPeerAccess()`. Same crash reproduces with the existing C++ baseline `test/gpu/test_symmetric_heap.cpp` — pre-existing runtime/HIP issue unrelated to this change. No runtime ABI gaps for Phases 3-7. The full lowering pipeline can be built using only existing exports: `mgpuSymmetricHeapInit/Destroy`, `mgpuGetRank/WorldSize`, `mgpuSymmetricAlloc/Free`, `mgpuGetHeapBases`, `mgpuBarrier`, `mgpuMemcpy` (D2D for cross-rank reads — direct kernel read from peer-VA isn't supported on some chipsets, so D2D-to-local-then- read is the required pattern). `docs/MultiGPUPlan.md` updated with Phase 2 status section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop the SHARE_GPU=1 escape hatch from run.sh. Colocating ranks on a single GPU silently bypasses the symmetric-heap / XGMI path and reports false-positive PASSes — exactly what the test exists to validate. Replace with a precondition check that exits non-zero when fewer GPUs are visible than ranks were requested. Validated on rad-mi325x-1 (8x MI325X) at WORLD_SIZE=2,4,8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@mawad-amd

Introduce an AIR primitive for the symmetric-heap pointer rebase, in preparation for the kernel-driven producer/consumer redesign per @mawad-amd's review feedback on PR Xilinx#1577. %peer = air.translate %src, %from, %to, %bases : memref<NxT, A>, !llvm.ptr Signature: - $source: memref on $from_rank's symmetric heap - $from_rank, $to_rank: index-typed rank ids - $heap_bases: !llvm.ptr to the per-rank base table from mgpuGetHeapBases() - result: same memref type, addressing $to_rank's slice of the same collective allocation The op is Pure and folds when from_rank == to_rank (statically equal SSA values or matching constant attrs). Naming follows IRIS's `__translate`. Lowering pass `air-translate-to-llvm` expands each op to the peer-VA arithmetic plus a freshly-built LLVM memref descriptor: byte_diff = ptrtoint(bases[to]) - ptrtoint(bases[from]) peer_aligned_ptr = src_aligned_ptr + byte_diff (i8 GEP) build descriptor { peer_ptr, peer_ptr, 0, sizes, strides } unrealized_conversion_cast back to result memref type The expansion is pure arithmetic (arith + memref + llvm dialect), no runtime calls — therefore valid both at host scope and inside `gpu.func`, provided heap_bases is threaded as a kernel argument. Tests: - mlir/test/Dialect/AIR/air_translate.mlir: parser/printer + folder - mlir/test/Conversion/AIRToROCDL/air_translate_to_llvm.mlir: lowering shape on 1D, 2D-addrspace, gpu.func body, and no-op cases Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@mawad-amd

Per @mawad-amd's review feedback on PR Xilinx#1577: replace the host-orchestrated mgpuMemcpy reference test with a kernel-driven producer/consumer pair. Cross-rank data movement is now performed by GPU compute units issuing loads/stores directly into peer HBM over XGMI, not by the HIP copy engine. Changes: - air_sym_handwritten.mlir is rewritten as one gpu.module with two gpu.func kernels: * producer (rank 0): each thread writes 42.0 into rank 1's `data` via memref.store on a peer memref produced by air.translate. Lane 0 of each warp signals the per-warp flag with a release atomicrmw on rank 1's `flags`. * consumer (rank 1): lane 0 of each warp spins on its flag with an acquire atomic load until producer signals; gpu.barrier then releases all 64 lanes to read their data slot and copy it into a verify buffer. Host D2H reads verify_buf and checks 42.0. The host driver (func.func @main) initializes the symmetric heap, copies heap_bases into a device-resident buffer (workaround for the fact that mgpuGetHeapBases returns a host pointer), and dispatches the producer or consumer kernel based on rank. - run.sh adds the GPU compilation chain (rocdl-attach-target, convert-gpu-to-rocdl, gpu-module-to-binary, gpu-async-region, gpu-to-llvm) before mlir-runner. - run.sh sets HIP_VISIBLE_DEVICES=$i + LOCAL_RANK=0 per process so each rank sees only its own GPU as device 0. This eliminates the device-binding ambiguity between airgpu's hipSetDevice and MLIR's built-in gpu.launch_func handling that would otherwise cause rank N>0 to fail with hipErrorInvalidDevice when launching kernels. Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1): W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0) W=4: ALL 4 RANKS PASSED (rank 0/1 active, ranks 2-3 idle) W=8: ALL 8 RANKS PASSED (rank 0/1 active, ranks 2-7 idle) This is the first time GPU compute units (not the HIP copy engine) have been observed driving cross-rank data movement over XGMI in this stack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two CI fixes: 1. air_translate_to_llvm.mlir: add `// REQUIRES: gpu`. The pass `--air-translate-to-llvm` is only registered when AIR_ENABLE_GPU=ON (it lives in the gpu-only conversion-pass set). Without the gate the test fails on non-GPU builds with air-opt: Unknown command line argument '--air-translate-to-llvm' This matches the pattern already used by the sibling tests air_to_rocdl.mlir and air_gpu_outlining.mlir. 2. AIRTranslateToLLVMPass.{h,cpp}: clang-format-17 reflow. The header banner had a too-long filename which clang-format wrapped into a broken two-line banner ("//===- ...PASS.h ----*- C++\n//-*-===//"), and a few function calls in the .cpp wanted slightly different wrapping. Match the surrounding header-banner convention (80 cols wide) and accept the .cpp reflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address layer-violation feedback: air.translate's $heap_bases operand was typed as !llvm.ptr, mixing LLVM dialect into a high-level AIR op signature (the only AIR op that did so). The right MLIR-native type for "array of pointer-width values in memory" is memref<?xindex>: - memref expresses the "array in memory" semantic - index is the pointer-width integer type already used elsewhere (e.g. memref.extract_aligned_pointer_as_index) - the dynamic ?-dim matches the variable world_size Op signature changes from: air.translate %src, %from, %to, %bases : memref<NxT, A>, !llvm.ptr to: air.translate %src, %from, %to, %bases : memref<NxT, A>, memref<?xindex> Lowering pass now does memref.load + arith.subi/addi (steps 1-3 below) instead of llvm.getelementptr + llvm.load + llvm.ptrtoint + arith.subi + llvm.getelementptr-i8. The LLVM dialect only appears in step 4 (materialize peer address as !llvm.ptr) and step 5 (build memref descriptor) — both unavoidable since memref descriptors *are* LLVM structs. Host-side wiring: a small wrap_bases(!llvm.ptr, i64) -> memref<?xindex> helper builds a memref descriptor over the device-resident heap_bases buffer once. From there it's a memref everywhere — through gpu.launch_func, into the kernel, into air.translate. The air_LLVMPtr type-predicate def in AIR.td is removed; AIR.td no longer imports any LLVM-dialect type machinery. The "#include mlir/Dialect/LLVMIR/LLVMTypes.h" in AIRDialect.h is dropped (no AIR op signature uses LLVM types anymore). Validated on rad-mi325x-1 (8x MI325X, gfx942, ROCm 7.1.1): W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0) W=4: ALL 4 RANKS PASSED W=8: ALL 8 RANKS PASSED FileCheck unit tests updated for both the dialect (parser/printer/ folder) and the conversion (lowering shape). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@consumer

- Consumer kernel never calls air.translate (it reads its OWN local data, which the producer wrote remotely from the producer side). So the %bases : memref<?xindex> arg in @consumer was unused. Drop it from both the kernel signature and the host-side gpu.launch_func arg list. - Both kernels declared %c1 = arith.constant 1 : index but neither actually used it. Drop. Verified on rad-mi325x-1 W=2/4/8 — consumer still PASSes with verify[0]=42.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three pieces of review feedback on the handwritten test: 1. Validation theater. The verify branch only checked element 0 and only ever printed PASS — msg_fail was declared but never referenced. A bug that signalled flag[0] but failed to write warps 1..3's slice would still pass. Now: scf.for over all 256 elements counts mismatches, prints msg_fail with the first one, and on any failure calls exit(1) so run.sh sees a non-zero process exit and reports "SOME RANKS FAILED" (matches the saved no-green-without-validation convention). 2. Atomic syncscope is the silent contract that makes XGMI propagation work. Producer's atomicrmw release and consumer's atomic load acquire emit no syncscope keyword, relying on the LLVM IR default = System scope (cross-device on AMDGPU). New FileCheck test sym_atomic_syncscope.mlir asserts both ops survive convert-gpu-to-rocdl with no syncscope qualifier present, with a block comment explaining the AMDGPU LangRef behavior and linking to the relevant section. The handwritten file's atomic comment blocks now point at this test. 3. Comments throughout were too verbose. Sweeping trim of the file header, kernel sections, helpers, and main: 411 -> 348 lines. Substance unchanged; comments now state the why (or the contract), not the what. Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1): W=2/4/8 -> ALL N RANKS PASSED consumer reports verify[0]=42.0 with the full 256-element check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous wrap_data / wrap_flags / wrap_bases helpers each hand-built an LLVM memref descriptor struct (!llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>), hardcoding the in-flight memref-to-LLVM ABI three times. An upstream descriptor-layout change would silently break all three. Collapse to a single wrap_bytes(ptr, size_bytes) -> memref<?xi8> that builds the descriptor once. Use sites do memref.view to retype: %data_bytes = wrap_bytes(%data_ptr, %c1024_bytes) %data_m = memref.view %data_bytes[%c0][] : memref<?xi8> to memref<256xf32> %flags_bytes = wrap_bytes(%flags_ptr, %c16_bytes) %flags_m = memref.view %flags_bytes[%c0][] : memref<?xi8> to memref<4xi32> %bases_bytes = wrap_bytes(%bases_devptr, %bases_size) %bases = memref.view %bases_bytes[%c0][%world_idx] : memref<?xi8> to memref<?xindex> ; verify_buf wrapped same way at the consumer The struct-type literal now appears in exactly one place. memref.view is a standard upstream op with its own well-tested lowering. Validated on rad-mi325x-1: W=2/4/8 all PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mic gap The 5-op extract_aligned_pointer_as_index -> index_cast -> inttoptr -> index_cast -> getelementptr sequence was duplicated in producer and consumer kernels. Factor into one private func.func @flag_slot_ptr inside gpu.module @sym_kernels (gpu.module accepts non-kernel funcs; the GPU compilation pipeline compiles them alongside the kernels). Add a TODO comment explaining the upstream memref dialect gap that forces this descent: memref.atomic_rmw and memref.generic_atomic_rmw lack ordering and syncscope, and there is no memref.atomic_load / memref.atomic_store at all. We need release/acquire + system scope for the cross-XGMI flag handshake, which today only the LLVM dialect exposes. When upstream memref grows ordering+syncscope on its atomic ops, this helper goes away in favor of memref.atomic_rmw / load. Producer and consumer atomic blocks each shrink from 9 ops to 1 + 1 helper call. Net diff: ~16 lines saved across the file. Validated on rad-mi325x-1: W=2/4/8 all PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Change the producer's release-atomicrmw and consumer's acquire-atomic-load in air_sym_handwritten.mlir from default (no syncscope qualifier) to `syncscope("")`. The empty string is LLVM IR's canonical spelling of the System scope; this makes the cross-device intent self-documenting at the MLIR level rather than relying on a default-omitted contract. Behavior is unchanged: `syncscope("")` lowers to LLVM IR identical to the unqualified form (LLVM textual IR omits the `syncscope(...)` token when scope == System), survives `convert-gpu-to-rocdl`, and runs e2e on 2x MI325X (verified on rad-mi325x-1). Update sym_atomic_syncscope.mlir FileCheck contract test accordingly: assert `syncscope("")` is preserved through the pipeline instead of asserting absence of any syncscope keyword. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The barrier after lane-0's spin-wait on the per-warp flag is unnecessary on AMDGPU: - Within-wave control sync: lanes execute in SIMT lockstep, so lanes 1..63 of each wave cannot leave the scf.if before lane 0 does. - Memory visibility: L1 is wave-shared, so lane 0's `syncscope("") acquire` load makes the producer's writes visible to the whole wave without needing a workgroup-level fence. Verified e2e on 2x MI325X (rad-mi325x-1), 5/5 runs PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ariants The Phase 2 reference test now ships two parallel kernel-driven examples of the symmetric-heap producer/consumer pattern, each demonstrating a different cross-rank synchronization mechanism on the same outer harness: air_sym_handwritten_atomic.mlir LLVM atomicrmw release (producer) + atomic load acquire (consumer), both with syncscope("") = LLVM System scope = cross-device per AMDGPUUsage. Spec-defined ordering contract; the lowering invariant is pinned by sym_atomic_syncscope.mlir. air_sym_handwritten_cacheline.mlir Cache-line atomicity: producer writes 32 i32 (one 128-byte line) in a single vec store with the flag in-band at lane 31; consumer spins via gpu.shuffle of lane 31 until flag==1. No atomics, no fences. Trades the LLVM-spec contract for a microarchitectural one (relies on gfx940 vec-store cache-line atomicity and XGMI publishing peer cache lines whole on MI300). run.sh now accepts INPUT=atomic|cacheline (default cacheline). The two files share the mgpu* host harness, the wrap_bytes helper, and the heap-init / verify_buf D2H readback / fail-loud exit pattern; only the cross-rank handoff differs. Both verified on 2x MI325X (rad-mi325x-1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The handwritten cross-rank symmetric-heap test fundamentally needs a producer + a consumer process; world_size=1 has no peer to talk to. The old %is_solo branch printed a "skipping" message and exited 0, which is worse than useless now that we have real multi-GPU CI: a misconfigured single-process launch would be reported as a green test even though nothing was exercised. Replace the graceful skip with a fail-loud precondition at the launcher boundary (run.sh) and remove the corresponding MLIR-level branch: - run.sh now refuses NUM_RANKS < 2 with a clear ERROR + exit 1, matching the existing pattern for NUM_GPUS < NUM_RANKS. - Both air_sym_handwritten_{atomic,cacheline}.mlir lose the %is_solo if/else wrapping; rank-dispatch (producer/consumer/idle) is now at the top level. The @msg_only1 global is removed. Verified on 2x MI325X: INPUT=atomic PASS, INPUT=cacheline PASS, `bash run.sh 1` refused at the launcher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New conversion pass that replaces each `air.rank` op by inlining its body in place, with rank IDs computed at runtime via `mgpuGetRank()` and delinearized into the rank's N-D iteration space. Replaces `air-rank-to-launch` for the GPU pipeline (which serialized ranks via scf.for — a placeholder for single-process execution). After this pass each process executes the entire `air.rank` body once, with its rank id resolved dynamically from the runtime. Heap lifecycle (`mgpuSymmetricHeapInit` / `mgpuSymmetricHeapDestroy`) is bracketed around the parent function once per function (not per rank). - `mlir/include/air/Conversion/AIRRankToMgpuPass.h` — public header - `mlir/include/air/Conversion/GPUPasses.td` — `air-rank-to-mgpu` def with `heap-size` option (default 256 MB) - `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRRANKTOMGPU` - `mlir/lib/Conversion/AIRRankToMgpuPass.cpp` — pass implementation - `mlir/lib/Conversion/CMakeLists.txt`, `Passes.cpp` — registration - `mlir/test/Conversion/AIRRankToMgpu/rank_to_mgpu.mlir` — FileCheck unit tests (10 cases; see Test plan below) - `test/gpu/symmetric_heap_dma/air_sym_with_rank.mlir` — high-level air.rank-based equivalent of the Phase 2 hand-written reference - `test/gpu/symmetric_heap_dma/run.sh` — `INPUT=rank|handwritten` selector to run either form through the same multi-process driver FileCheck unit tests cover: - 1D / 2D rank delinearization (remsi/divsi) - Default + custom heap-size option - Async form (token replacement via wait_all) - Async dependencies (blocking wait_all insertion) - Multiple `air.rank` ops per function (init/destroy emitted once) - Multiple `func.return` paths (destroy before each) - Kernel operand mapping (block args replaced by SSA operands) - Idempotent extern decls across multiple functions - No-op when no `air.rank` is present (audit-found bug fixed: pass was unconditionally inserting decls) End-to-end: rad-mi300a-sh5-1, SHARE_GPU=1, 2 ranks, INPUT=rank — both ranks PASS the cross-rank read. Caveat: same SHARE_GPU=1 single-physical-GPU caveat as Phase 2. True multi-GPU re-validation is needed before declaring multi-GPU production- ready (blocked on ROCm-side work). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New conversion pass that replaces `memref.alloc` carrying the unit attribute `air.symmetric` with a call to `mgpuSymmetricAlloc(size, stream)`. The returned `!llvm.ptr` is wrapped in an LLVM memref descriptor (struct) and projected back to the original memref type via `builtin.unrealized_conversion_cast` so downstream uses keep working through the standard `convert-to-llvm` pipeline. `memref.dealloc` ops whose operand traces back (through the cast) to a symmetric alloc are rewritten to `mgpuSymmetricFree`. The pass is a no-op when no `air.symmetric` allocations are present. - `mlir/include/air/Conversion/AIRSymmetricAllocToMgpuPass.h` — header - `mlir/include/air/Conversion/GPUPasses.td` — `air-symmetric-alloc-to-mgpu` def - `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRSYMMETRICALLOCTOMGPU` - `mlir/lib/Conversion/AIRSymmetricAllocToMgpuPass.cpp` — implementation - `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration - `mlir/test/Conversion/AIRSymmetricAllocToMgpu/symmetric_alloc.mlir` — FileCheck - `test/gpu/symmetric_heap_dma/air_sym_with_alloc.mlir` — high-level e2e using `memref.alloc {air.symmetric}` (Phase 3 + Phase 4 chained) - `test/gpu/symmetric_heap_dma/run.sh` — `INPUT=alloc` selector FileCheck unit tests: - 1D alloc + dealloc shape (size, descriptor, cast, free) - 2D alloc with row-major strides in descriptor - Element type byte-size: f32 (4B), f64 (8B), i32 (4B) - Multiple symmetric allocs share one decl pair - Pass is a no-op for non-symmetric allocs - Pass is a no-op when there are zero symmetric allocs End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks): - INPUT=handwritten — PASS (Phase 2 baseline) - INPUT=rank — PASS (Phase 3) - INPUT=alloc — PASS (Phase 4: chained Phase 4 + Phase 3 lowering) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a `src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side `mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`. The peer pointer is computed at runtime as: peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank]) where `local_ptr` is extracted from the local-side memref via `memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]` gives this rank's symmetric heap base. - Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global) - The op must be at host scope (not inside a `gpu.launch` or `gpu.func`) - "Entire memref" form only — no explicit `[offsets][sizes][strides]` - Only one of `src_rank` / `dst_rank` may be set per op These restrictions match the hand-written reference's Phase 2 pattern. They can be relaxed in follow-up work. - `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header - `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def - `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU` - `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation - `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration - `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck - `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering - `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector FileCheck unit tests cover: - src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi, byte-stride GEP, mgpuMemcpy) - dst_rank lowering (peer pointer becomes dst arg) - 2D memref byte size - f64 element type byte size - Multiple cross-rank DMAs share extern decls - Pass is a no-op for non-cross-rank DMAs End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks): - INPUT=handwritten — PASS (Phase 2 baseline) - INPUT=rank — PASS (Phase 3) - INPUT=alloc — PASS (Phase 4) - INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3) Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their own dst_buf; verification reads back 1.0. Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 — true multi-GPU re-validation is needed before declaring multi-GPU production-ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New conversion pass that lowers `air.channel` ops of `channel_type = "gpu_symmetric_heap"` plus their put/get pair to host-side `mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`, with `mgpuBarrier`-based synchronization. Per channel: - put becomes `mgpuBarrier()` (publish — the data is already in the symmetric heap via the put's `air.symmetric` source memref) - get becomes `mgpuBarrier()` followed by `mgpuMemcpy(dst, peer_va(src), sz)` where the peer rank is the get's first index operand - The channel symbol itself is erased This makes `air.channel` of type `gpu_symmetric_heap` syntactic sugar over cross-rank DMA, with the additional benefit of decoupling the producer site (where put appears) from the consumer site (where get appears) via the channel symbol. - One put and one get per channel symbol - Both at host scope (no `gpu.launch`/`gpu.func`) - put's source memref must be `air.symmetric`-tagged - "Entire memref" form on both sides (no offsets/sizes/strides) - get must take exactly one index operand (the peer rank) - `mlir/include/air/Conversion/AIRGpuChannelToMgpuPass.h` — header - `mlir/include/air/Conversion/GPUPasses.td` — pass def - `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRGPUCHANNELTOMGPU` - `mlir/lib/Conversion/AIRGpuChannelToMgpuPass.cpp` — implementation - `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration - `mlir/test/Conversion/AIRGpuChannelToMgpu/gpu_channel.mlir` — FileCheck - `test/gpu/symmetric_heap_dma/air_sym_with_channel.mlir` — high-level e2e - `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=channel` selector FileCheck unit tests cover: - Basic put/get pair lowering shape (barrier + mgpuMemcpy with peer-VA) - Channel symbol is erased after lowering - Pass is a no-op for non-`gpu_symmetric_heap` channels (e.g., `npu_*`) End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks): - INPUT=handwritten — PASS - INPUT=rank — PASS - INPUT=alloc — PASS - INPUT=dma — PASS - INPUT=channel — PASS (chains Phase 6 -> Phase 4 -> Phase 3 -> standard LLVM) Both ranks publish their src_buf via channel.put, then read rank 0's slot via channel.get. Verification reads back 1.0. Same SHARE_GPU=1 single-physical-GPU caveat as previous PRs in the stack — true multi-GPU re-validation is needed before declaring multi-GPU production-ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add `--multi-gpu` flag to `aircc` that selects the host-only multi-GPU compilation pipeline: 1. air-cross-rank-dma-to-mgpu (Phase 5) 2. air-gpu-channel-to-mgpu (Phase 6) 3. air-symmetric-alloc-to-mgpu (Phase 4) 4. air-rank-to-mgpu (Phase 3) 5. convert-scf-to-cf + convert-to-llvm + reconcile-unrealized-casts The output is host-only LLVM IR meant to be run as N processes via `mlir-runner` linked against `libairgpu.so` (and `libmlir_rocm_runtime.so`) with `RANK` / `WORLD_SIZE` / `LOCAL_RANK` env vars set. The original Phase 7 plan included a `--multi-rank=N` runner mode that forks N processes from `aircc` itself. That has been intentionally deferred: the existing launcher in `test/gpu/symmetric_heap_dma/run.sh` already does the multi-process fork+wait pattern in ~30 lines of shell, and wrapping it into `aircc` adds little value over that. Worth revisiting if real deployment integration (SLURM, MPI, etc.) becomes a requirement. - `tools/aircc/aircc.cpp` — adds `--multi-gpu` flag and `runMultiGpuCompilation()` function - `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=prelowered SRC=<path>` mode that takes `aircc --multi-gpu` output directly - `docs/MultiGPUPlan.md` — Phase 7 section updated with the new design - [x] `aircc --target=gpu --multi-gpu` builds and produces clean LLVM IR matching the structure of what `INPUT=channel` produces in `run.sh` - [x] Compiled output uses `llvm.func @mgpuSymmetricHeapInit/Destroy/Get*`, `llvm.func @mgpuSymmetricAlloc/Free`, `llvm.func @mgpuMemcpy`, `llvm.func @mgpuBarrier`, `llvm.func @mgpuGetHeapBases` (verified via `head -20`) - [ ] E2E run-through of `aircc --multi-gpu` output via `run.sh INPUT=prelowered`: deferred — SLURM allocation expired during testing. The compile pipeline is byte-for-byte equivalent to the manually-invoked `INPUT=channel` pipeline (which we verified PASS), so a regression is unlikely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

erwei-xilinx force-pushed the multigpu-phase7-aircc-integration branch 6 times, most recently from b0672bf to 46244e6 Compare May 6, 2026 01:01

erwei-xilinx mentioned this pull request May 6, 2026

[multi-gpu] Phase 1: namespace channel_type, add cross-rank attrs, doc plan #1576

Merged

7 tasks

erwei-xilinx force-pushed the multigpu-phase7-aircc-integration branch 5 times, most recently from 5ebf653 to 388885a Compare May 6, 2026 16:35

erwei-xilinx and others added 5 commits May 6, 2026 16:37

erwei-xilinx force-pushed the multigpu-phase7-aircc-integration branch from 388885a to 67d0eef Compare May 6, 2026 16:38

erwei-xilinx force-pushed the multigpu-phase7-aircc-integration branch from 67d0eef to c39b1d1 Compare May 6, 2026 17:07

erwei-xilinx force-pushed the multigpu-phase7-aircc-integration branch from bc9f9e2 to 23d70f2 Compare May 6, 2026 18:22

erwei-xilinx force-pushed the multigpu-phase7-aircc-integration branch from 23d70f2 to 6897ab8 Compare May 6, 2026 18:53

erwei-xilinx force-pushed the multigpu-phase7-aircc-integration branch from dd153c9 to bea9bea Compare May 6, 2026 19:02

erwei-xilinx force-pushed the multigpu-phase7-aircc-integration branch from bea9bea to 00dba84 Compare May 6, 2026 20:15

erwei-xilinx force-pushed the multigpu-phase7-aircc-integration branch from 00dba84 to 5a61809 Compare May 12, 2026 15:36

erwei-xilinx force-pushed the multigpu-phase7-aircc-integration branch from 5a61809 to 11e6b7c Compare May 12, 2026 15:38

erwei-xilinx force-pushed the multigpu-phase7-aircc-integration branch from 11e6b7c to 0815f9a Compare May 12, 2026 16:19

erwei-xilinx and others added 6 commits May 12, 2026 17:08

erwei-xilinx force-pushed the multigpu-phase7-aircc-integration branch from 0815f9a to e9a1fc6 Compare May 12, 2026 17:20

erwei-xilinx mentioned this pull request May 12, 2026

[multi-gpu] restructure tests: rename symmetric_heap_dma → multi_gpu, group by IR level #1613

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[multi-gpu] Phase 7: aircc integration (--multi-gpu flag)#1582

[multi-gpu] Phase 7: aircc integration (--multi-gpu flag)#1582
erwei-xilinx wants to merge 19 commits into
Xilinx:mainfrom
erwei-xilinx:multigpu-phase7-aircc-integration

erwei-xilinx commented May 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

erwei-xilinx commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Files

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

erwei-xilinx commented May 3, 2026 •

edited

Loading